Comparison of Modeling Target in LSTM-RNN Duration Model
نویسندگان
چکیده
Speech duration is an important component in statistical parameter speech synthesis(SPSS). In LSTM-RNN based SPSS system, the speech duration affects the quality of synthesized speech in two aspects, the prosody of speech and the position features in acoustic model. This paper investigated the effects of duration in LSTM-RNN based SPSS system. The performance of the acoustic models with position features at different levels are compared. Also, duration models with different network architectures are presented. A method to utilize the priori knowledge that the sum of state duration of a phoneme should be equal to the phone duration is proposed and proved to have better performance in both state duration and phone duration modeling. The result shows that acoustic model with state-level position features has better performance in acoustic modeling (especially in voice/unvoice classification), which means statelevel duration model still has its advantage and the duration models with the priori knowledge can result in better speech quality.
منابع مشابه
Discrete Duration Model for Speech Synthesis
The acoustic model and the duration model are the two major components in statistical parametric speech synthesis (SPSS) systems. The neural network based acoustic model makes it possible to model phoneme duration at phone-level instead of state-level in conventional hidden Markov model (HMM) based SPSS systems. Since the duration of phonemes is countable value, the distribution of the phone-le...
متن کاملWhat to Do Next: Modeling User Behaviors by Time-LSTM
Recently, Recurrent Neural Network (RNN) solutions for recommender systems (RS) are becoming increasingly popular. The insight is that, there exist some intrinsic patterns in the sequence of users’ actions, and RNN has been proved to perform excellently when modeling sequential data. In traditional tasks such as language modeling, RNN solutions usually only consider the sequential order of obje...
متن کاملSemi-Supervised Training in Deep Learning Acoustic Model
We studied the semi-supervised training in a fully connected deep neural network (DNN), unfolded recurrent neural network (RNN), and long short-term memory recurrent neural network (LSTM-RNN) with respect to the transcription quality, the importance data sampling, and the training data amount. We found that DNN, unfolded RNN, and LSTM-RNN are increasingly more sensitive to labeling errors. For ...
متن کاملAdvanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition
Long short-term memory (LSTM) is normally used in recurrent neural network (RNN) as basic recurrent unit. However, conventional LSTM assumes that the state at current time step depends on previous time step. This assumption constraints the time dependency modeling capability. In this study, we propose a new variation of LSTM, advanced LSTM (A-LSTM), for better temporal context modeling. We empl...
متن کاملInvestigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition
Recently, the integration of deep neural networks (DNNs) trained to predict senone posteriors with conventional language modeling methods has been proved effective for spoken language recognition. This work extends some of the senone-based DNN frameworks by replacing the DNN with the LSTM RNN. Two of these approaches use the LSTM RNN to generate features. The features are extracted from the rec...
متن کامل